==========================================================================================================
The goal of this explorative analysis is to investigate which chemical properties influence the quality of red wines. The data set contains 1,599 red wines with 11 variables on their chemical properties. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent). The importance and interplay between the chemical compounds will be investigated regarding the experts’ rating.
The following table display the top rows of the dataframe.
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.4 0.70 0.00 1.9 0.076
## 2 2 7.8 0.88 0.00 2.6 0.098
## 3 3 7.8 0.76 0.04 2.3 0.092
## 4 4 11.2 0.28 0.56 1.9 0.075
## 5 5 7.4 0.70 0.00 1.9 0.076
## 6 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
Learning: In the data set is an index column “X” that will be removed from further inspections.
The following table displays the column name and the resp. data type.
## fixed.acidity volatile.acidity citric.acid
## "numeric" "numeric" "numeric"
## residual.sugar chlorides free.sulfur.dioxide
## "numeric" "numeric" "numeric"
## total.sulfur.dioxide density pH
## "numeric" "numeric" "numeric"
## sulphates alcohol quality
## "numeric" "numeric" "integer"
Learning: There are just numeric columns in the data set. Most are floats, some are integers. All columns except the quality column contain continuous data.
The following section provides an overview about the distribution of the features.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
The distribution of fixed.acidity is right skewed. There are some outliers with a fixed.acidity higher than approx \(13g/dm^3\).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
The distribution of volatile.acidity is a bit right skewed with some outliers with volatile.acidity higher than \(1.0g/dm^3\). Mean and Median have nearly the same value.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
The distribution of citric.acid is right skewed. The values of Mean and Median are close together.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
The distribution of residual.sugar is right skewed with a lot of outliers bigger than approx \(3.5g/dm^3\).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
The distribution of chlorides is right skewed with a lot of outliers bigger than approx. \(0.15g/dm^3\).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
The distribution of free.sulfur.dioxide is right skewed. There are several outliers with more than approx \(42mg/dm^3\).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
The distribution of total.sulfur.dioxide is right skewed. There are sveral outliers with more than \(120mg/dm^3\).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
Accept for some outliers the distribution from density is normally distributed. Therefore Mean and Median are very close to be the same.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
The distribution of the pH is also normally distributed.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
The distribution of sulphates is right skewed with some outliers with more than \(1.0g/dm3\).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
The distribution of alcohol is also right skewed. 50% of the wines have an alcohol concentration between 9.50% up to 11.10% by volume.
Most experts' rating range between 3 and 8. About 1,300 wines were rated between 5 or 6 regarding their quality. The mode is a at a quality of 5 with 681 ratings.
# Display quantities of discrete experts\' quality rating values
table(df$quality)
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
# Detect maximum of quality ratings
which.max(table(df$quality))
## 5
## 3
Based on the data we can say that there is no visible relationsship between fixed.acidity and quality. First it seems that the quality rises with higer fixed.acidity, but after rating 7 it descreases.
## Difference of features' means from lowest and highest rating: 0.2066667
Based on the data we can say that on average the missing of volatile.acidity seems the have a negative impact on the experts quality ratings.
## Difference of features' means from lowest and highest rating: -0.4611667
Based on the data we can say that on average the presence of citric.acid results in a higher quality rating.
## Difference of features' means from lowest and highest rating: 0.2201111
Based on the data on average it doesn’t look like residual.sugar has a big impact on the experts' quality ratings.
## Difference of features' means from lowest and highest rating: -0.05722222
Based on the data it seems that chlorides have little influence on quality. When the value of chlorides is less, the quality decrease slightly.
## Difference of features' means from lowest and highest rating: -0.05405556
Based on the data we can say that on average high concentrations of free.sulfur.dioxide result in medium quality ratings while low concentration lead either to poor or good ratings.
## Difference of features' means from lowest and highest rating: 2.277778
As for total.sulfur.dioxide, on average high concentrations of free.sulfur.dioxide result in medium quality ratings while low concentration lead either to poor or good ratings.
## Difference of features' means from lowest and highest rating: 8.544444
Based on the data we can say that on average lower density results in better quality ratings.
## Difference of features' means from lowest and highest rating: -0.002251778
Based on the data we can say that on average there is a relationship between pH value and quality. The lower the pH the higher the experts' quality rating.
## Difference of features' means from lowest and highest rating: -0.1307778
Based on the data we can say that on average the higher the sulphates concentration the better the quality ratings.
## Difference of features' means from lowest and highest rating: 0.1977778
Based on the data we can say that on average the higher the alcohol comcentration the better the experts' quality ratings.
## Difference of features' means from lowest and highest rating: 2.139444
The dataset contains 1,599 rows. Each row represents a red wine observation that consists of 11 coninuous meassurements of chemical properties and discrete experts’ rating on the red wine quality. There are no missing values in the data set. Several of the attributes may be correlated according to some information attached to the data set, according to the naming of the variables (like f. e. free.sulfur.dioxide and total.sulfur.dioxide) and according to the behaviour of the features box plots when it comes to facet grid exploration.
Inspecting the facetted box plots visually the following features seem to have an effect on the experts's quality ratings:
* fixed.acidity (positive effect)
* volatile.acidity (negative effect)
* citric.acid (positve effect)
* chlorides (negative effect)
* free.sulfur.dioxide (positive/negative effect)
* total.sulfur.dioxide (positive/negative effect)
* density (negative effect)
* pH (negative effect)
* sulphates (positive effect)
* pH (positive effect)
When comparing the features' means for good and bad ratings the following features seem to be most important for the experts' rating (from highest to lowest difference):
* total.sulfur.dioxide: 8,544
* free.sulfur.dioxide: 2,278
* alcohol: 2,139
* volatile.acidit: -0,461
* citric.acid: 0,220
* fixed.acidity: 0,207
* sulphates: 0,198
* pH: -0,131
* residual.sugar: -0,057
* chlorides: -0,054
* density: -0,002
The interplay between the features will help support the investigation in the next bivariate and multivariate step. I could imagine that the features with the biggest delta between good and badf rating play the biggest role for the experts' rating.
In didn’t create any new variables on my own but I did a facet grid exploration and a mean comparison of good and bad ratings.
The are some right skewed distributions for several features. When inspecting the features using facet grid it turned out the especially the best and worst ratings were skewed.
The following table displays the Pearson correlation coefficient r on features basis. A correlation coefficient of ±0.5 indicates a strong correlation, ±0.3 indictes medium correlation while ±0.1 indicates a weak correlation.
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.0 -0.3 0.7
## volatile.acidity -0.3 1.0 -0.6
## citric.acid 0.7 -0.6 1.0
## residual.sugar 0.1 0.0 0.1
## chlorides 0.1 0.1 0.2
## free.sulfur.dioxide -0.2 0.0 -0.1
## total.sulfur.dioxide -0.1 0.1 0.0
## density 0.7 0.0 0.4
## pH -0.7 0.2 -0.5
## sulphates 0.2 -0.3 0.3
## alcohol -0.1 -0.2 0.1
## quality 0.1 -0.4 0.2
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.1 0.1 -0.2
## volatile.acidity 0.0 0.1 0.0
## citric.acid 0.1 0.2 -0.1
## residual.sugar 1.0 0.1 0.2
## chlorides 0.1 1.0 0.0
## free.sulfur.dioxide 0.2 0.0 1.0
## total.sulfur.dioxide 0.2 0.0 0.7
## density 0.4 0.2 0.0
## pH -0.1 -0.3 0.1
## sulphates 0.0 0.4 0.1
## alcohol 0.0 -0.2 -0.1
## quality 0.0 -0.1 -0.1
## total.sulfur.dioxide density pH sulphates alcohol
## fixed.acidity -0.1 0.7 -0.7 0.2 -0.1
## volatile.acidity 0.1 0.0 0.2 -0.3 -0.2
## citric.acid 0.0 0.4 -0.5 0.3 0.1
## residual.sugar 0.2 0.4 -0.1 0.0 0.0
## chlorides 0.0 0.2 -0.3 0.4 -0.2
## free.sulfur.dioxide 0.7 0.0 0.1 0.1 -0.1
## total.sulfur.dioxide 1.0 0.1 -0.1 0.0 -0.2
## density 0.1 1.0 -0.3 0.1 -0.5
## pH -0.1 -0.3 1.0 -0.2 0.2
## sulphates 0.0 0.1 -0.2 1.0 0.1
## alcohol -0.2 -0.5 0.2 0.1 1.0
## quality -0.2 -0.2 -0.1 0.3 0.5
## quality
## fixed.acidity 0.1
## volatile.acidity -0.4
## citric.acid 0.2
## residual.sugar 0.0
## chlorides -0.1
## free.sulfur.dioxide -0.1
## total.sulfur.dioxide -0.2
## density -0.2
## pH -0.1
## sulphates 0.3
## alcohol 0.5
## quality 1.0
The p-value describes the probability of the correlation coefficient that the correlation is significant. A p-value greater of 0.05 means that the correlation is not significant, less than 0.05 means it is significant.
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 0.00000 0.00000 0.00000
## volatile.acidity 0.00000 0.00000 0.00000
## citric.acid 0.00000 0.00000 0.00000
## residual.sugar 0.00000 0.93892 0.00000
## chlorides 0.00018 0.01422 0.00000
## free.sulfur.dioxide 0.00000 0.67470 0.01474
## total.sulfur.dioxide 0.00001 0.00221 0.15555
## density 0.00000 0.37876 0.00000
## pH 0.00000 0.00000 0.00000
## sulphates 0.00000 0.00000 0.00000
## alcohol 0.01365 0.00000 0.00001
## quality 0.00000 0.00000 0.00000
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.00000 0.00018 0.00000
## volatile.acidity 0.93892 0.01422 0.67470
## citric.acid 0.00000 0.00000 0.01474
## residual.sugar 0.00000 0.02617 0.00000
## chlorides 0.02617 0.00000 0.82412
## free.sulfur.dioxide 0.00000 0.82412 0.00000
## total.sulfur.dioxide 0.00000 0.05809 0.00000
## density 0.00000 0.00000 0.38050
## pH 0.00061 0.00000 0.00487
## sulphates 0.82521 0.00000 0.03888
## alcohol 0.09258 0.00000 0.00549
## quality 0.58322 0.00000 0.04283
## total.sulfur.dioxide density pH sulphates
## fixed.acidity 0.00001 0.00000 0.00000 0.00000
## volatile.acidity 0.00221 0.37876 0.00000 0.00000
## citric.acid 0.15555 0.00000 0.00000 0.00000
## residual.sugar 0.00000 0.00000 0.00061 0.82521
## chlorides 0.05809 0.00000 0.00000 0.00000
## free.sulfur.dioxide 0.00000 0.38050 0.00487 0.03888
## total.sulfur.dioxide 0.00000 0.00435 0.00782 0.08602
## density 0.00435 0.00000 0.00000 0.00000
## pH 0.00782 0.00000 0.00000 0.00000
## sulphates 0.08602 0.00000 0.00000 0.00000
## alcohol 0.00000 0.00000 0.00000 0.00018
## quality 0.00000 0.00000 0.02096 0.00000
## alcohol quality
## fixed.acidity 0.01365 0.00000
## volatile.acidity 0.00000 0.00000
## citric.acid 0.00001 0.00000
## residual.sugar 0.09258 0.58322
## chlorides 0.00000 0.00000
## free.sulfur.dioxide 0.00549 0.04283
## total.sulfur.dioxide 0.00000 0.00000
## density 0.00000 0.00000
## pH 0.00000 0.02096
## sulphates 0.00018 0.00000
## alcohol 0.00000 0.00000
## quality 0.00000 0.00000
The following Correlogram visualizes the correlation between the features. If the p-value is not significant the coefficient of correlation is set with 0.
The correlogram illustrates the relationships beween features itself and between features and quality. The values show the pearson correlation coefficient. The value of ±0.5 indicates a strong correlation. The value of ±0.3 stands for a medium correlation and ±0.1 indicates a weak correlation. We can see the alcohol concentration is correlated with quality. There is also a negative correlation of volatile.acidity with quality. We can also see that some features are correlated with others features like free.sulfur.dioxide and total.sulfur.dioxide.
As seen from the correlogram free.sulfur.dioxide and total.sulfur.dioxide are hightly correlated with a Pearson correlation coefficient of 0.7. The positiv correlation shows, that with increasing free.sulfur.dioxide the total.sulfur.dioxide rises, too. This can also be seen in this scatter plot. The red dotted line visualizes the regression line for the data.
As seen from the correlogram citric.acid and ficed.acidity are hightly correlated with a Pearson correlation coefficient of 0.7. The positiv correlation shows, that with increasing citric.acid the ficed.acidity rises, too. This can also be seen in this scatter plot. The red dotted line visualizes the regression line for the data.
As seen from the correlogram fixed.acidity and density are hightly correlated with a Pearson correlation coefficient of 0.7. The positiv correlation shows, that with increasing fixed.acidity the density rises, too. This can also be seen in this scatter plot. The red dotted line visualizes the regression line for the data.
As seen from the correlogram fixed.acidity and pH are hightly correlated with a Pearson correlation coefficient of -0.7. The negative correlation shows, that with decreasing fixed.acidity the pH decreases, too. This can also be seen in this scatter plot. The red dotted line visualizes the regression line for the data.
The correlation analysis validates a (strong) relationship beween:
* quality <> alcohol
* quality <> volatile.acidity
* quality <> sulphates
The correlation analysis validates strong positive relationship beween:
* total.sulfur.dioxide <> free.sulfur.dioxide
* residual.sugar <> density
* sulphates <> chlorides
* citric.acid <> fixed.acidity
* fixed.acidity <> density
Furthermore these is a strong negative relationship beween:
* pH <> fixed.acidity
* pH <> citric.acid
* volatile.acidity <> citric.acid
* alcohol <> density
The strongest relationships are between:
* total.sulfur.dioxide <> free.sulfur.dioxide
* citric.acid <> fixed.acidity
* fixed.acidity <> density
* pH <> fixed.acidity
This scatterplot visaualizes the relationship between free.sulfur.dioxide and total.sulfur.dioxide. The quality is visualized by color saturation. We are not able to detect a clear relationaship with quality ratings here.
This scatter plot visualizes the relationship of citric.acid and fixed.acidity. The seems to be a linear relationship between citric.acid and fixed.acidity: The higher the concentration of citric.acid, the higher the fixed.acidity. Better wines seem to have more citric.acid resp. fixed.acidity.
As for the previous plot we can also detect a linear relationship of fixed.acidity and density. The higher the fixed.acidity, the higher the density. We are not able to detect a clear relationship with the experts' quality ratings as good ratings can be found with small fixed.acidity / density values and with relatively high ones, too.
This scatterplot visualizes the relationship between fixed.acidity and pH (which is negatively correlated). Although the relationship between these values is linear we cannot detect and relationship with the quality ratings as good ratings can be found with small fixed.acidity / pH values and with relatively high ones, too.
This scatterplot visualizes the relationship between alcohol and suphates. The color saturation shows that good wines tend to have higher alcohol and higher sulphates concentration.
This scatterplot visualizes the relationship of alcohol and volatile.acidity on experts' quality ratings. Higher alcohol concentration tends results in better ratings. Higher concentration of volatile.acidity also results in better ratings but the relationship doesn’t seem that stromng as for alcohol.
This scatterplot visualizes the relationship between alcohol and citric.acid. Based on the data we can say that higher concentration of alcohol and higher concentration of citric.acid results in better experts' quality ratings.
##
## Call:
## lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid +
## residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide +
## density + pH + sulphates + alcohol, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.68911 -0.36652 -0.04699 0.45202 2.02498
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.9652084 21.1945750 1.036 0.3002
## fixed.acidity 0.0249906 0.0259485 0.963 0.3357
## volatile.acidity -1.0835903 0.1211013 -8.948 < 0.0000000000000002
## citric.acid -0.1825639 0.1471762 -1.240 0.2150
## residual.sugar 0.0163313 0.0150021 1.089 0.2765
## chlorides -1.8742252 0.4192832 -4.470 0.00000837395338361
## free.sulfur.dioxide 0.0043613 0.0021713 2.009 0.0447
## total.sulfur.dioxide -0.0032646 0.0007287 -4.480 0.00000800460981846
## density -17.8811638 21.6330999 -0.827 0.4086
## pH -0.4136531 0.1915974 -2.159 0.0310
## sulphates 0.9163344 0.1143375 8.014 0.00000000000000213
## alcohol 0.2761977 0.0264836 10.429 < 0.0000000000000002
##
## (Intercept)
## fixed.acidity
## volatile.acidity ***
## citric.acid
## residual.sugar
## chlorides ***
## free.sulfur.dioxide *
## total.sulfur.dioxide ***
## density
## pH *
## sulphates ***
## alcohol ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.648 on 1587 degrees of freedom
## Multiple R-squared: 0.3606, Adjusted R-squared: 0.3561
## F-statistic: 81.35 on 11 and 1587 DF, p-value: < 0.00000000000000022
From the linear regression model we can say that volatile.acidity, chlorides, total.sulfur.dioxide, sulphates and alcohol are the most important features for predicting the experts' quality ratings.
According to the Pearson correlation analysis the strongest correlations with quality could be found for alcohol, volatile.acidity and sulphates. While alcohol had the strongest relationship with quality, in addtion with sulphates and citric.acid the quality results results seem even better.
The influence of the combination of citric acid and alcohol on experts's quality rating was surprising to me.
I trained a linear regression model. The most important features for the regression are:
* chlorides
* total.sulfur.dioxide
* sulphates
* alcohol
All of the features already showed significance during facet grid examination. ——
The first chart visualizes the relationship of the alcohol concentration in red wine and the experts's quality rating of the wine. Although there is a gap for the experts's rating of 5 a clear linear relationship can be observed.
This scatterplot visualizes the relationship between alcohol and suphates. The color saturation shows that good wines tend to have higher alcohol and higher sulphates concentration.
This scatterplot visualizes the relationship between alcohol and citric.acid. Based on the data we can say that higher concentration of alcohol and higher concentration of citric.acid results in better experts' quality ratings.
In the first step I inspected the histograms to get a first idea of the data. It was not possible to draw conclusions about which ingredients are responsible for good ratings. Breaking down the data using facet grid exploration helped to gain these insights. It became clear which features seem to influence the ratings in a positive or negative way. To find correlations between the features itself and the the experts's rating a Pearson analysis has been made. Alcohol showed the biggest direct influence on the quality ratings. The fact was suprising that alcohol was the only feature that had a big influence on the experts' ratings. It was difficult to detect supporting features just by looking at the scatter plots. Therefore a Linear Regression Model has been calculated to help finding the supporting features. With the of the most important features for the linear regression model the final plots have been prepared and selected.
1 - fixed acidity (tartaric acid - g / dm^3)
2 - volatile acidity (acetic acid - g / dm^3)
3 - citric acid (g / dm^3)
4 - residual sugar (g / dm^3)
5 - chlorides (sodium chloride - g / dm^3
6 - free sulfur dioxide (mg / dm^3)
7 - total sulfur dioxide (mg / dm^3)
8 - density (g / cm^3)
9 - pH
10 - sulphates (potassium sulphate - g / dm3)
11 - alcohol (% by volume)
12 - quality (score between 0 and 10)
1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
5 - chlorides: the amount of salt in the wine
6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content
9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
11 - alcohol: the percent alcohol content of the wine
Output variable (based on sensory data):
12 - quality (score between 0 and 10)